Basic Tools of Multivariate Matching

Biostatistics Causal Inference Tutorial R

Observational study;
Causal Inference;
Matching;
R

Hai Nguyen
April 19, 2022

Motivation

A Small Example

Welders are exposed to chromium and nickel, substances that can cause inappropriate links between DNA and proteins, which in turn may disrupt gene expression or interfere with replication of DNA. Costa, Zhitkovich, and Toniolo [10] measured DNA-protein cross-links in samples of white blood cells from 47 subjects (all male). 01

Unmatched data for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no). Response is DPC=DNA-protein cross-links in percent in white blood cells. All 47 subjects are male.

Welders — Controls

ID Age Race Smoker DPC ID Age Race Smoker DPC
1 38 C N 1.77 1 48 AA N 1.08
2 44 C N 1.02 2 63 C N 1.09
3 39 C Y 1.44 3 44 C Y 1.1
4 33 AA Y 0.65 4 40 C N 1.1
5 35 C Y 2.08 5 50 C N 0.93
6 39 C Y 0.61 6 52 C N 1.11
7 27 C N 2.86 7 56 C N 0.98
8 43 C Y 4.19 8 47 C N 2.2
9 39 C Y 4.88 9 38 C N 0.88
10 43 AA N 1.08 10 34 C N 1.55
11 41 C Y 2.03 11 42 C N 0.55
12 36 C N 2.81 12 36 C Y 1.04
13 35 C N 0.94 13 41 C N 1.66
14 37 C N 1.43 14 41 AA Y 1.49
15 39 C Y 1.25 15 31 AA Y 1.36
16 34 C N 2.97 16 56 AA Y 1.02
17 35 C Y 1.01 17 51 AA N 0.99
18 53 C N 2.07 18 36 C Y 0.65
19 38 C Y 1.15 19 44 C N 0.42
20 37 C N 1.07 20 35 C N 2.33
21 38 C Y 1.63 21 34 C Y 0.97
22 39 C Y 0.62
23 45 C N 1.02
24 42 C N 1.78
25 30 C N 0.95
26 35 C Y 1.59

Load dataset

# loadind data
gen.tox <- read_excel("data/table81.xlsx", sheet = "data")

data.tab <- gen.tox[c(1:3, 22:24),]
kable(data.tab, caption = "Generic Toxicology Dataset (6-observation example)") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:load data)Generic Toxicology Dataset (6-observation example)
ID Group Age Race Smoker DPC
1 Welders 38 C N 1.77
2 Welders 44 C N 1.02
3 Welders 39 C Y 1.44
22 Controls 48 AA N 1.08
23 Controls 63 C N 1.09
24 Controls 44 C Y 1.10

Descriptive Statistics

m.age.w <- round(mean(gen.tox$Age[gen.tox$Group=="Welders"]),0)
m.age.c <- round(mean(gen.tox$Age[gen.tox$Group=="Controls"]),0)
aa.p.w <- round(table(gen.tox$Race[gen.tox$Group=="Welders"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Welders"]))*100,0)
aa.p.c <- round(table(gen.tox$Race[gen.tox$Group=="Controls"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Controls"]))*100,0)
s.p.w <- round(table(gen.tox$Smoker[gen.tox$Group=="Welders"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Welders"]))*100,0)
s.p.c <- round(table(gen.tox$Smoker[gen.tox$Group=="Controls"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Controls"]))*100,0)

Welders — Controls

Mean Age AA Smoker Mean Age AA Smoker
38 10 52 43 19 35
t.test(Age ~ Group, data=gen.tox)

    Welch Two Sample t-test

data:  Age by Group
t = 2.2537, df = 42.167, p-value = 0.02947
alternative hypothesis: true difference in means between group Controls and group Welders is not equal to 0
95 percent confidence interval:
 0.4662145 8.4422104
sample estimates:
mean in group Controls  mean in group Welders 
              42.69231               38.23810 
tbl.age.group <- table(gen.tox$Group, gen.tox$Race)
tbl.age.group
          
           AA  C
  Controls  5 21
  Welders   2 19
chisq.test(tbl.age.group)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl.age.group
X-squared = 0.26754, df = 1, p-value = 0.605
fisher.test(tbl.age.group, conf.int = T, conf.level = 0.95)

    Fisher's Exact Test for Count Data

data:  tbl.age.group
p-value = 0.4364
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.3170452 25.9844336
sample estimates:
odds ratio 
   2.22477 

Notation

Data genetic toxicity had \(L\) = 47 subjects, \(l = 1,2, ..., L\).

For subject \(l\), the observed covariate, \(\textbf{x}_l\), is 3-dimensional, \(\textbf{x}_l = (x_{l1}, x_{l2}, x_{l3})^T\) , where

  1. \(x_{l1}\) is the age

  2. \(x_{l2}\) encodes race, \(x_{l2}\) = 1 if \(l\) is African American, \(x_{l2}\) =0 if \(l\) is Caucasian,

  3. \(x_{l3}\) encodes smoking, \(x_{l3}\) = 1 if \(l\) is is a current smoker, \(x_{l3}\) = 0 otherwise.

\(\rightarrow\) For instance, \(x_1\) = \((38,0,0)^T\) .

  1. The variable \(Z_l\) distinguishes treated subjects from potential controls: \(Z_l\) = 1 for a treated subject, here a welder; \(Z_l\) = 0 for a potential control.
gen.tox$Race <- ifelse(gen.tox$Race=="C", 0, 1)
gen.tox$Smoker <- ifelse(gen.tox$Smoker=="N", 0, 1)

Propensity Score

Brief examination of Genetic Toxicity data suggested:

In short,

  1. matching on \(e(\textbf{X})\) is often practical even when there are many covariates in x because \(e(\textbf{X})\) is a single variable,

  2. failure to balance \(e(\textbf{X})\) implies that \(X\) is not balanced

  3. matching on \(e(\textbf{X})\) tends to balance all of \(X\). On the negative side, success in balancing the observed covariates, \(X\), provides no assurance that unmeasured covariates are balanced

\(\rightarrow\) For example, the men whose his father worked as a welder is associated with a much higher chance that the son also will work as a welder, so father’s occupation is an unmeasured covariate that could be used to predict treatment \(Z\); then success in matching on the propensity score \(e(\textbf{X})\) would tend to balance age, race and smoking, but there is no reason to expect it to balance father’s occupation.

Conduct PS

The PS is unknown, but it can be estimated from the data at hand. PS is estimated by a linear logit model \[ log {\frac{e(\textbf{X}_l)}{1 - e(\textbf{X}_l)}} = \zeta_0 + \zeta_1 x_{l1} + \zeta_2 x_{l2} + \zeta_3 x_{l3}\]

and the fitted values \(\hat{e}(\textbf{X}_l)\) from this model are the estimates of the PS.

#fit a propensity score model. logistic regression
gen.tox$Group <- relevel(as.factor(gen.tox$Group), ref="Controls")
psmodel <- glm(Group ~ Age + Race + Smoker, family=binomial(), data=gen.tox)

#show coefficients etc
summary(psmodel)

Call:
glm(formula = Group ~ Age + Race + Smoker, family = binomial(), 
    data = gen.tox)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4968  -1.0216  -0.5245   1.0786   1.8186  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  3.06543    2.14200   1.431    0.152
Age         -0.08503    0.05178  -1.642    0.101
Race        -0.76900    0.98252  -0.783    0.434
Smoker       0.55095    0.65212   0.845    0.398

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64.623  on 46  degrees of freedom
Residual deviance: 58.706  on 43  degrees of freedom
AIC: 66.706

Number of Fisher Scoring iterations: 3
#create propensity score
pscore<-psmodel$fitted.values
gen.tox$ps <- round(pscore,2)

# mean of PS in Welders group
mean.ps.welder <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)

# mean of PS in 21 largest Controls group
mean.ps.21largestcontrols <- gen.tox %>%
  filter(Group=="Controls") %>%
  top_n(21, ps) %>% # wrapper that uses filter() and min_rank() to select the top or bottom entries in each group, ordered by ps.
  arrange(desc(ps)) %>% # order data follow ps decreasing
  summarize(round(mean(ps),2)) %>%
  as.numeric()

Estimated PS for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no).

ID Age Race Smoker DPC \(\hat{e}(\textbf{X})\) ID Age Race Smoker DPC \(\hat{e}(\textbf{X})\)
1 38 C N 1.77 0.46 1 48 AA N 1.08 0.14
2 44 C N 1.02 0.34 2 63 C N 1.09 0.09
3 39 C Y 1.44 0.57 3 44 C Y 1.1 0.47
4 33 AA Y 0.65 0.51 4 40 C N 1.1 0.42
5 35 C Y 2.08 0.65 5 50 C N 0.93 0.23
6 39 C Y 0.61 0.57 6 52 C N 1.11 0.2
7 27 C N 2.86 0.68 7 56 C N 0.98 0.15
8 43 C Y 4.19 0.49 8 47 C N 2.2 0.28
9 39 C Y 4.88 0.57 9 38 C N 0.88 0.46
10 43 AA N 1.08 0.2 10 34 C N 1.55 0.46
11 41 C Y 2.03 0.53 11 42 C N 0.55 0.54
12 36 C N 2.81 0.5 12 36 C Y 1.04 0.38
13 35 C N 0.94 0.52 13 41 C N 1.66 0.64
14 37 C N 1.43 0.48 14 41 AA Y 1.49 0.4
15 39 C Y 1.25 0.57 15 31 AA Y 1.36 0.35
16 34 C N 2.97 0.54 16 56 AA Y 1.02 0.55
17 35 C Y 1.01 0.65 17 51 AA N 0.99 0.13
18 53 C N 2.07 0.19 18 36 C Y 0.65 0.12
19 38 C Y 1.15 0.6 19 44 C N 0.42 0.64
20 37 C N 1.07 0.48 20 35 C N 2.33 0.34
21 38 C Y 1.63 0.6 21 34 C Y 0.97 0.52
22 39 C Y 0.62
23 45 C N 1.02
24 42 C N 1.78
25 30 C N 0.95
26 35 C Y 1.59
ps.w <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
ps.c <- round(mean(gen.tox$ps[gen.tox$Group=="Controls"]),2)

Welders — Controls

Mean Age AA Smoker \(\hat{e}(\mathbf{x})\) Mean Age AA Smoker \(\hat{e}(\mathbf{x})\)
38 10 52 0.51 43 19 35 0.4

Distance Matrices

Squared difference in the \(\hat{e}(\textbf{x})\) distance

ps.vec.c <- gen.tox$ps[gen.tox$Group=="Controls"]
ps.vec.w <- gen.tox$ps[gen.tox$Group=="Welders"]
d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    d.m[i,j] <- round((ps.vec.w[i] - ps.vec.c[j])^2, 2)
  }
}

Table. Squared differences in PS between welders and controls: \((\hat{e}(\textbf{X})_{Welder} - \hat{e}(\textbf{X})_{Control})^2\)

Rows are the 21 welders and columns are for the first 6 of 26 potential controls.

Welder <- c(1:21)
d.m.d <- as.data.frame(d.m)
names(d.m.d) <- paste0("Control ", 1:26)
d.m.d.n <- as.data.frame(cbind(Welder, d.m.d))
kable(d.m.d.n[,1:7], caption = "Squared differences in `PS` between welders and controls")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:table of sq difference PS)Squared differences in PS between welders and controls
Welder Control 1 Control 2 Control 3 Control 4 Control 5 Control 6
1 0.10 0.14 0.00 0.00 0.05 0.07
2 0.04 0.06 0.02 0.01 0.01 0.02
3 0.18 0.23 0.01 0.02 0.12 0.14
4 0.14 0.18 0.00 0.01 0.08 0.10
5 0.26 0.31 0.03 0.05 0.18 0.20
6 0.18 0.23 0.01 0.02 0.12 0.14
7 0.29 0.35 0.04 0.07 0.20 0.23
8 0.12 0.16 0.00 0.00 0.07 0.08
9 0.18 0.23 0.01 0.02 0.12 0.14
10 0.00 0.01 0.07 0.05 0.00 0.00
11 0.15 0.19 0.00 0.01 0.09 0.11
12 0.13 0.17 0.00 0.01 0.07 0.09
13 0.14 0.18 0.00 0.01 0.08 0.10
14 0.12 0.15 0.00 0.00 0.06 0.08
15 0.18 0.23 0.01 0.02 0.12 0.14
16 0.16 0.20 0.00 0.01 0.10 0.12
17 0.26 0.31 0.03 0.05 0.18 0.20
18 0.00 0.01 0.08 0.05 0.00 0.00
19 0.21 0.26 0.02 0.03 0.14 0.16
20 0.12 0.15 0.00 0.00 0.06 0.08
21 0.21 0.26 0.02 0.03 0.14 0.16

Look at the disadvantage:
- that 2 controls with the same \(\hat{e}(\mathbf{x})\) may have different patterns of covariates, \(\mathbf{x}\), and this is ignored
\(\Rightarrow\) In the \(1^{st}\) row and \(3^{rd}\) and \(4^{th}\) columns of Table above, the distance is 0 to two decimal places. Actually, the distances between welder #1 and potential controls #3 and #4 are, respectively, \((0.46−0.47)^2 = 0.0001\) and \((0.46−0.42)^2 = 0.0016,\) so control #3 is ever so slightly closer to welder #1.
- However, in terms of the details of \(\textbf{x}\), control #4 looks to be the better match, a nonsmoker with a two-year difference in age, as opposed to control #3, a smoker with a six-year difference in age.
\(\rightarrow\) Because younger smokers are more common in the welder group, the PS is indifferent between a younger nonsmoker and an older smoker, but the details of \(\textbf{x}\) suggest that control #4 is a better match for welder #1 than is control #3.

Justifying by adding caliper on the PS

An alternative distance (Paul R. Rosenbaum and Rubin 1985):

sd.ps <- round(sd(pscore),3) # [1] 0.172
w <- round(sd.ps/2,3)
paste0("The standard deviation of ps: ", sd.ps)
[1] "The standard deviation of ps: 0.172"
paste0("The caliper w (half of sd of ps): ", sd.ps, ", used to demonstate in this post.")
[1] "The caliper w (half of sd of ps): 0.172, used to demonstate in this post."

In problems of practical size, a caliper of 20% of the sd of ps is more common, and even that may be too large. A reasonable strategy: to start with a caliper of 20% of the sd of ps \(\rightarrow\) adjusting the caliper if needed to obtain balance on the propensity score.

Mahalanobis distance matrix

If \(\hat{\Sigma}\) is the sample covariance matrix of \(\textbf{x}\), then the estimated Mahalanobis distance between \(\textbf{x}_k\) and \(\textbf{x}_l\) is \[ (\textbf{x}_k - \textbf{x}_l)^T \hat{\Sigma}^{-1} (\textbf{x}_k - \textbf{x}_l) \]

var.m <-gen.tox %>%
  dplyr::select(Age, Race, Smoker) %>%
  unname() %>%
  as.matrix()

Sample var-cov matrix

# sample variance covariance matrix
cov.m <- cov(var.m)
round(cov.m,2)
      [,1] [,2]  [,3]
[1,] 54.04 0.39 -0.94
[2,]  0.39 0.13  0.02
[3,] -0.94 0.02  0.25

Inverse of sample var-cov matrix

# inverse of sample variance covariance matrix
cov.m.inv <- inv(cov.m)
round(cov.m.inv,3)
       [,1]   [,2]   [,3]
[1,]  0.021 -0.077  0.084
[2,] -0.077  8.127 -1.009
[3,]  0.084 -1.009  4.407

Mahalanobis dmat Rubin (1980)

Distance matrix would have 21 treated subject (Welders) and 26 controls; the value in row \(i\) and column \(j\) of the table is the distance between the \(i^{th}\) treated subject and the \(j^{th}\) potential control.

var.m.w <-gen.tox %>%
  filter(Group=="Welders") %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()

var.m.c <-gen.tox %>%
  filter(Group=="Controls") %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()

m.d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.d.m[i,j] <- t(var.m.w[i,] - var.m.c[j,]) %*% cov.m.inv %*% (var.m.w[i,] - var.m.c[j,])
  }
}

# Mahalanobis distance
Welder <- c(1:21)
d.m.d.m <- as.data.frame(m.d.m)
names(d.m.d.m) <- paste0("Control ", 1:26)
d.d.m.d.n <- cbind(Welder, d.m.d.m)
kable(round(d.d.m.d.n[,1:7],2), caption = "Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 1: Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker
Welder Control 1 Control 2 Control 3 Control 4 Control 5 Control 6
1 8.65 12.82 6.15 0.08 2.95 4.02
2 7.84 7.40 4.41 0.33 0.74 1.31
3 13.33 12.21 0.51 4.26 5.05 5.70
4 6.51 28.55 12.29 11.42 16.20 17.65
5 13.85 15.80 1.66 4.08 6.51 7.49
6 13.33 12.21 0.51 4.26 5.05 5.70
7 13.95 26.58 13.18 3.47 10.85 12.82
8 13.46 9.27 0.02 5.09 4.24 4.56
9 13.33 12.21 0.51 4.26 5.05 5.70
10 0.51 19.40 14.89 7.85 10.20 11.17
11 13.31 10.65 0.18 4.59 4.56 5.05
12 9.24 14.95 7.06 0.33 4.02 5.25
13 9.60 16.08 7.57 0.51 4.61 5.93
14 8.92 13.87 6.58 0.18 3.47 4.61
15 13.33 12.21 0.51 4.26 5.05 5.70
16 10.00 17.25 8.13 0.74 5.25 6.65
17 13.85 15.80 1.66 4.08 6.51 7.49
18 9.41 2.05 4.56 3.47 0.18 0.02
19 13.40 13.04 0.74 4.15 5.35 6.08
20 8.92 13.87 6.58 0.18 3.47 4.61
21 13.40 13.04 0.74 4.15 5.35 6.08

Another way to produce the Mahalanobis distance matrix: look at the source code of mahal function.

source("mahal.R")
print(mahal)
function (z, X) 
{
    X <- as.matrix(X)
    n <- dim(X)[1]
    rownames(X) <- 1:n
    k <- dim(X)[2]
    m <- sum(z)
    cv <- cov(X)
    out <- matrix(NA, m, n - m)
    Xt <- X[z == 1, ]
    Xc <- X[z == 0, ]
    rownames(out) <- rownames(X)[z == 1]
    colnames(out) <- rownames(X)[z == 0]
    require(MASS)
    icov <- ginv(cv)
    for (i in 1:m) {
        out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
    }
    out
}

Mahalanobis dmat within PS calipers.

Rows are the 21 welders and columns are for the first 6 of 26 potential controls. An \(\infty\) signifies that the caliper is violated.

m.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.m[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m[i,j])
  }
}
Welder <- c(1:21)
d.m.m <- as.data.frame(m.m)
names(d.m.m) <- paste0("Control ", 1:26)
d.d.m.m <- cbind(Welder, d.m.m)
kable(round(d.d.m.m[,1:7],2), caption = "Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:mahal w PS caliper)Mahalanobis distances within PS calipers. An Inf signifies that the caliper is violated.
Welder Control 1 Control 2 Control 3 Control 4 Control 5 Control 6
1 Inf Inf 6.15 0.08 Inf Inf
2 Inf Inf Inf 0.33 Inf Inf
3 Inf Inf Inf Inf Inf Inf
4 Inf Inf 12.29 Inf Inf Inf
5 Inf Inf Inf Inf Inf Inf
6 Inf Inf Inf Inf Inf Inf
7 Inf Inf Inf Inf Inf Inf
8 Inf Inf 0.02 5.09 Inf Inf
9 Inf Inf Inf Inf Inf Inf
10 0.51 Inf Inf Inf 10.20 11.17
11 Inf Inf 0.18 Inf Inf Inf
12 Inf Inf 7.06 0.33 Inf Inf
13 Inf Inf 7.57 Inf Inf Inf
14 Inf Inf 6.58 0.18 Inf Inf
15 Inf Inf Inf Inf Inf Inf
16 Inf Inf 8.13 Inf Inf Inf
17 Inf Inf Inf Inf Inf Inf
18 9.41 Inf Inf Inf 0.18 0.02
19 Inf Inf Inf Inf Inf Inf
20 Inf Inf 6.58 0.18 Inf Inf
21 Inf Inf Inf Inf Inf Inf

Cons of Mahalanobis distance

\(\rightarrow\) its standard deviation will be inflated, and \(\rightarrow\) Mahalanobis distance will tend to ignore that covariate in matching.

\(\rightarrow\) Mahalanobis distance gives greater weight to binary variables with probabilities near 0 or 1 than to binary variables with probabilities closer to 1/2.

\(\Longrightarrow\) In many contexts, rare binary covariates are not of overriding importance, and outliers do not make a covariate unimportant \(\rightarrow\) Mahalanobis distance may not be appropriate with covariates of this kind.

Solution rank-based Mahalanobis distance

  1. replaces each of the covariates, one at a time, by its ranks, with average ranks for ties,
    \(\Longrightarrow\) limits the influence of outliers.
  2. premultiplies and postmultiplies the covariance matrix of the ranks by a diagonal matrix whose diagonal elements are the ratios of the standard deviation of untied ranks, 1, …, L, to the standard deviations of the tied ranks of the covariates,
    \(\rightarrow\) the adjusted covariance matrix has a constant diagonal
    \(\Longrightarrow\) prevents heavily tied covariates, such as rare binary variables, from having increased influence due to reduced variance.
  3. computes the Mahalanobis distance using the ranks and this adjusted covariance matrix.
### Note that the covariance matrix now is on the rank-based matrix
data.r <- gen.tox %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()
for (j in 1:dim(data.r)[2]){data.r[,j] <- rank(data.r[,j])}
cov.m.r <- cov(data.r)

## Step 1
var.m.w.r <- data.r[gen.tox$Group=="Welders",]
var.m.c.r <- data.r[gen.tox$Group=="Controls",]

## Step 2:
var.untied <- var(1:dim(gen.tox)[1])
rat <- sqrt(var.untied/diag(cov.m.r))

cov.m.r.adjusted <- diag(rat) %*% cov.m.r %*% diag(rat)
cov.m.r.adjusted.inv <- inv(cov.m.r.adjusted)

m.d.m.r <- matrix(NA, nrow = 21, ncol = 26)

for (i in 1:21){
  for (j in 1:26){
    m.d.m.r[i,j] <- t(var.m.w.r[i,] - var.m.c.r[j,]) %*% cov.m.r.adjusted.inv %*% (var.m.w.r[i,] - var.m.c.r[j,])
  }
}

# rank-based Mahalanobis distance
Welder <- c(1:21)
d.m.d.m.r <- as.data.frame(m.d.m.r)
names(d.m.d.m.r) <- paste0("Control ", 1:26)
d.d.m.d.m.r <- cbind(Welder, d.m.d.m.r)
kable(round(d.d.m.d.m.r[,1:7],2), caption = "Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:rank-based Mahalanobis distance)Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker
Welder Control 1 Control 2 Control 3 Control 4 Control 5 Control 6
1 4.61 4.40 5.98 0.33 2.69 3.21
2 2.98 0.70 3.21 0.47 0.15 0.28
3 6.91 4.62 0.84 3.04 3.66 3.93
4 8.14 14.80 10.43 7.69 12.17 13.00
5 9.03 8.49 3.93 3.66 6.55 7.15
6 6.91 4.62 0.84 3.04 3.66 3.93
7 10.20 12.31 12.86 3.93 9.30 10.26
8 6.77 3.29 0.04 3.92 2.99 3.05
9 6.91 4.62 0.84 3.04 3.66 3.93
10 0.25 4.72 7.61 3.03 3.72 4.01
11 6.71 3.79 0.28 3.38 3.18 3.34
12 5.86 6.33 7.61 0.98 4.24 4.89
13 6.98 7.96 9.02 1.68 5.59 6.33
14 5.25 5.41 6.83 0.64 3.49 4.08
15 6.91 4.62 0.84 3.04 3.66 3.93
16 8.30 9.78 10.61 2.56 7.12 7.96
17 9.03 8.49 3.93 3.66 6.55 7.15
18 3.33 0.05 3.00 1.68 0.05 0.01
19 7.34 5.62 1.58 2.99 4.34 4.72
20 5.25 5.41 6.83 0.64 3.49 4.08
21 7.34 5.62 1.58 2.99 4.34 4.72

Another way to produce the rank-based Mahalanobis distance matrix: look at the source code of smahal function.

source("smahal.R")
print(smahal)
function (z, X) 
{
    X <- as.matrix(X)
    n <- dim(X)[1]
    rownames(X) <- 1:n
    k <- dim(X)[2]
    m <- sum(z)
    for (j in 1:k) {
        X[, j] <- rank(X[, j])
    }
    cv <- cov(X)
    vuntied <- var(1:n)
    rat <- sqrt(vuntied/diag(cv))
    cv <- diag(rat) %*% cv %*% diag(rat)
    out <- matrix(NA, m, n - m)
    Xc <- X[z == 0, ]
    Xt <- X[z == 1, ]
    rownames(out) <- rownames(X)[z == 1]
    colnames(out) <- rownames(X)[z == 0]
    library(MASS)
    icov <- ginv(cv)
    for (i in 1:m) {
        out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
    }
    out
}

\(\Longrightarrow\) a mismatch on race is counted as about equal to a mismatch on smoking.

Rank-based Mahalanobis distances within Propensity Score calipers. An Inf signifies that the caliper is violated

m.m.r <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.m.r[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m.r[i,j])
  }
}
Welder <- c(1:21)
d.m.m.r <- as.data.frame(m.m.r)
names(d.m.m.r) <- paste0("Control ", 1:26)
d.d.m.m.r <- cbind(Welder, d.m.m.r)
kable(round(d.d.m.m.r[,1:7],2), caption = "rank-based Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:r-mahal w PS caliper)rank-based Mahalanobis distances within PS calipers. An Inf signifies that the caliper is violated.
Welder Control 1 Control 2 Control 3 Control 4 Control 5 Control 6
1 Inf Inf 5.98 0.33 Inf Inf
2 Inf Inf Inf 0.47 Inf Inf
3 Inf Inf Inf Inf Inf Inf
4 Inf Inf 10.43 Inf Inf Inf
5 Inf Inf Inf Inf Inf Inf
6 Inf Inf Inf Inf Inf Inf
7 Inf Inf Inf Inf Inf Inf
8 Inf Inf 0.04 3.92 Inf Inf
9 Inf Inf Inf Inf Inf Inf
10 0.25 Inf Inf Inf 3.72 4.01
11 Inf Inf 0.28 Inf Inf Inf
12 Inf Inf 7.61 0.98 Inf Inf
13 Inf Inf 9.02 Inf Inf Inf
14 Inf Inf 6.83 0.64 Inf Inf
15 Inf Inf Inf Inf Inf Inf
16 Inf Inf 10.61 Inf Inf Inf
17 Inf Inf Inf Inf Inf Inf
18 3.33 Inf Inf Inf 0.05 0.01
19 Inf Inf Inf Inf Inf Inf
20 Inf Inf 6.83 0.64 Inf Inf
21 Inf Inf Inf Inf Inf Inf

Optimal Pair Matching

Hansen’s pairmatch function in his optmatch package makes Bertsekas’ Fortran code (aka ‘assignment problem’ solved by Kuhn in 1955) available from inside the statistical package R;

The SAS program proc assign also solves the assignment problem.

Match pair on PS

I generated code in which the treated’s ps match with the closest control’s ps. It seemed the marginal means were well balanced.
ctrl <- NULL
k <- NULL

d <- d.m.d.n[-1]

for (i in 1:21){
  k <- which.min(d[i,])
  ctrl[i] <- names(k)
  d <- d[-k]
}

index <- base::order(ctrl)

Welders <- gen.tox %>%
  filter(Group=="Welders") %>%
  dplyr::select(Age, Race, Smoker, ps)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps")]

Controls <- gen.tox %>%
  filter(Group=="Controls") %>%
  dplyr::select(Age, Race, Smoker, ps)

Controls$id <- paste0("Control ",1:26)

matched.controls <- Controls[which(Controls$id %in% ctrl),]

# Reorder follow the ctrl match order
matched.controls$id <- reorder.factor(matched.controls$id, new.order=ctrl)
#matched.controls <- matched.controls[-5]

matched.controls <- matched.controls %>% dplyr::arrange(id)

kable(cbind(Welders,matched.controls), caption="Pair match using the closed measure in the PS")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:pairmatch using closest ps)Pair match using the closed measure in the PS
Pair Age Race Smoker ps Age Race Smoker ps id
Pair 1 38 0 0 0.46 44 0 1 0.47 Control 3
Pair 2 44 0 0 0.34 47 0 0 0.28 Control 8
Pair 3 39 0 1 0.57 34 0 0 0.54 Control 10
Pair 4 33 1 1 0.51 38 0 0 0.46 Control 9
Pair 5 35 0 1 0.65 36 0 1 0.64 Control 12
Pair 6 39 0 1 0.57 31 1 1 0.55 Control 15
Pair 7 27 0 0 0.68 36 0 1 0.64 Control 18
Pair 8 43 0 1 0.49 40 0 0 0.42 Control 4
Pair 9 39 0 1 0.57 35 0 0 0.52 Control 20
Pair 10 43 1 0 0.20 48 1 0 0.14 Control 1
Pair 11 41 0 1 0.53 39 0 1 0.57 Control 22
Pair 12 36 0 0 0.50 42 0 0 0.38 Control 11
Pair 13 35 0 0 0.52 41 0 0 0.40 Control 13
Pair 14 37 0 0 0.48 42 0 0 0.38 Control 24
Pair 15 39 0 1 0.57 30 0 0 0.63 Control 25
Pair 16 34 0 0 0.54 35 0 1 0.65 Control 26
Pair 17 35 0 1 0.65 34 0 1 0.67 Control 21
Pair 18 53 0 0 0.19 50 0 0 0.23 Control 5
Pair 19 38 0 1 0.60 41 1 1 0.35 Control 14
Pair 20 37 0 0 0.48 44 0 0 0.34 Control 19
Pair 21 38 0 1 0.60 45 0 0 0.32 Control 23
m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age AA Smoker \(\hat{e}(\mathbf{x})\) Mean Age AA Smoker \(\hat{e}(\mathbf{x})\)
38 10 52 0.51 40 14 38 0.46

\(\Longrightarrow\) the marginal means the matching above are well balanced as expected

Next, I use the pairmatch function in optmatch package to compare.

gen.tox.op <- gen.tox
gen.tox.op$Group1 <- as.numeric(ifelse(gen.tox.op$Group=="Welders", 1,0))
#pairmatch(match_on( Group~ps, data=gen.tox ) )# did not run
pm <- pairmatch(Group1 ~ ps, data = gen.tox.op)
print(pairmatch(Group1 ~ ps, data = gen.tox.op), grouped = TRUE) # knowing the ordered id of treated and controls
 Group Members
   1.1   1, 35
  1.10  18, 27
  1.11  19, 46
  1.12   2, 29
  1.13  20, 34
  1.14  21, 39
  1.15   3, 41
  1.16   4, 30
  1.17   5, 47
  1.18   6, 43
  1.19   7, 42
   1.2  10, 26
  1.20   8, 45
  1.21   9, 31
   1.3  11, 25
   1.4  12, 40
   1.5  13, 44
   1.6  14, 24
   1.7  15, 36
   1.8  16, 32
   1.9  17, 33
summary(pairmatch(Group1 ~ ps, data = gen.tox.op))
Structure of matched sets:
1:1 0:1 
 21   5 
Effective Sample Size:  21 
(equivalent number of matched pairs).
#all.equal(names(pm), row.names(gen.tox))

## Housekeeping
ipm <- as.integer(pm)
gen.tox.op <- cbind(gen.tox.op, matches=pm, ipm)
data.pairmatch<-gen.tox.op[matched(pm),] # only select matched cases

Welders <- data.pairmatch %>%
  filter(Group=="Welders") %>%
  arrange(ipm) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm")]

matched.controls <- data.pairmatch %>%
  filter(Group=="Controls") %>%
  arrange(ipm) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the squared difference in the PS")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 2: Optimal pair match using the squared difference in the PS
Pair Age Race Smoker ps ipm Age Race Smoker ps ipm
1 Pair 1 38 0 0 0.46 1 41 1 1 0.35 1
18 Pair 2 53 0 0 0.19 2 52 0 0 0.20 2
19 Pair 3 38 0 1 0.60 3 30 0 0 0.63 3
2 Pair 4 44 0 0 0.34 4 47 0 0 0.28 4
20 Pair 5 37 0 0 0.48 5 41 0 0 0.40 5
21 Pair 6 38 0 1 0.60 6 36 0 1 0.64 6
3 Pair 7 39 0 1 0.57 7 35 0 0 0.52 7
4 Pair 8 33 1 1 0.51 8 38 0 0 0.46 8
5 Pair 9 35 0 1 0.65 9 35 0 1 0.65 9
6 Pair 10 39 0 1 0.57 10 39 0 1 0.57 10
7 Pair 11 27 0 0 0.68 11 34 0 1 0.67 11
10 Pair 12 43 1 0 0.20 12 50 0 0 0.23 12
8 Pair 13 43 0 1 0.49 13 42 0 0 0.38 13
9 Pair 14 39 0 1 0.57 14 34 0 0 0.54 14
11 Pair 15 41 0 1 0.53 15 40 0 0 0.42 15
12 Pair 16 36 0 0 0.50 16 44 0 0 0.34 16
13 Pair 17 35 0 0 0.52 17 45 0 0 0.32 17
14 Pair 18 37 0 0 0.48 18 44 0 1 0.47 18
15 Pair 19 39 0 1 0.57 19 31 1 1 0.55 19
16 Pair 20 34 0 0 0.54 20 42 0 0 0.38 20
17 Pair 21 35 0 1 0.65 21 36 0 1 0.64 21
m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age AA Smoker \(\hat{e}(\mathbf{x})\) Mean Age AA Smoker \(\hat{e}(\mathbf{x})\)
38 10 52 0.51 40 10 38 0.46

\(\Longrightarrow\) the marginal means the matching above are fairly well balanced, but the individual pairs are often not matched for race or smoking.

Match pair on Mahalanobis distance within PS calipers

The caliper is half the standard deviation of the PS, or 0.172/2 = 0.086

z<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# Mahalanobis distances
dmat<-mahal(z,x)

# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)

# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)

ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases

Welders <- data.pairmatch.cal %>%
  filter(Group=="Welders") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]

matched.controls <- data.pairmatch.cal %>%
  filter(Group=="Controls") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm.cal)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 3: Optimal pair match using the Mahalanobis distance within PS calipers
Pair Age Race Smoker ps ipm.cal Age Race Smoker ps ipm.cal
1 Pair 1 38 0 0 0.46 1 44 0 0 0.34 1
18 Pair 2 53 0 0 0.19 2 52 0 0 0.20 2
19 Pair 3 38 0 1 0.60 3 34 0 0 0.54 3
2 Pair 4 44 0 0 0.34 4 47 0 0 0.28 4
20 Pair 5 37 0 0 0.48 5 42 0 0 0.38 5
21 Pair 6 38 0 1 0.60 6 31 1 1 0.55 6
3 Pair 7 39 0 1 0.57 7 36 0 1 0.64 7
4 Pair 8 33 1 1 0.51 8 41 1 1 0.35 8
5 Pair 9 35 0 1 0.65 9 35 0 1 0.65 9
6 Pair 10 39 0 1 0.57 10 35 0 0 0.52 10
7 Pair 11 27 0 0 0.68 11 30 0 0 0.63 11
10 Pair 12 43 1 0 0.20 12 48 1 0 0.14 12
8 Pair 13 43 0 1 0.49 13 45 0 0 0.32 13
9 Pair 14 39 0 1 0.57 14 36 0 1 0.64 14
11 Pair 15 41 0 1 0.53 15 44 0 1 0.47 15
12 Pair 16 36 0 0 0.50 16 41 0 0 0.40 16
13 Pair 17 35 0 0 0.52 17 40 0 0 0.42 17
14 Pair 18 37 0 0 0.48 18 42 0 0 0.38 18
15 Pair 19 39 0 1 0.57 19 39 0 1 0.57 19
16 Pair 20 34 0 0 0.54 20 38 0 0 0.46 20
17 Pair 21 35 0 1 0.65 21 34 0 1 0.67 21
m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age AA Smoker \(\hat{e}(\mathbf{x})\) Mean Age AA Smoker \(\hat{e}(\mathbf{x})\)
38 10 52 0.51 40 14 38 0.45

Match pair on rank-based Mahalanobis distance within PS calipers

z<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# rank-based Mahalanobis distances
dmat<-smahal(z,x)

# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)

# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)

ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases

Welders <- data.pairmatch.cal %>%
  filter(Group=="Welders") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]

matched.controls <- data.pairmatch.cal %>%
  filter(Group=="Controls") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm.cal)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 4: Optimal pair match using the Mahalanobis distance within PS calipers
Pair Age Race Smoker ps ipm.cal Age Race Smoker ps ipm.cal
1 Pair 1 38 0 0 0.46 1 44 0 0 0.34 1
18 Pair 2 53 0 0 0.19 2 52 0 0 0.20 2
19 Pair 3 38 0 1 0.60 3 31 1 1 0.55 3
2 Pair 4 44 0 0 0.34 4 47 0 0 0.28 4
20 Pair 5 37 0 0 0.48 5 42 0 0 0.38 5
21 Pair 6 38 0 1 0.60 6 34 0 0 0.54 6
3 Pair 7 39 0 1 0.57 7 35 0 0 0.52 7
4 Pair 8 33 1 1 0.51 8 41 1 1 0.35 8
5 Pair 9 35 0 1 0.65 9 35 0 1 0.65 9
6 Pair 10 39 0 1 0.57 10 36 0 1 0.64 10
7 Pair 11 27 0 0 0.68 11 30 0 0 0.63 11
10 Pair 12 43 1 0 0.20 12 48 1 0 0.14 12
8 Pair 13 43 0 1 0.49 13 45 0 0 0.32 13
9 Pair 14 39 0 1 0.57 14 36 0 1 0.64 14
11 Pair 15 41 0 1 0.53 15 44 0 1 0.47 15
12 Pair 16 36 0 0 0.50 16 41 0 0 0.40 16
13 Pair 17 35 0 0 0.52 17 40 0 0 0.42 17
14 Pair 18 37 0 0 0.48 18 42 0 0 0.38 18
15 Pair 19 39 0 1 0.57 19 39 0 1 0.57 19
16 Pair 20 34 0 0 0.54 20 38 0 0 0.46 20
17 Pair 21 35 0 1 0.65 21 34 0 1 0.67 21
m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age AA Smoker \(\hat{e}(\mathbf{x})\) Mean Age AA Smoker \(\hat{e}(\mathbf{x})\)
38 10 52 0.51 40 14 38 0.45

Optimal Matching with Multiple Controls

In matching with multiple controls, each treated subject is matched to at least one, and possibly more than one, control:

The decision to match in fixed or variable ratio that affects both:

Matching with a variable number of controls is slightly more complex:

The principal advantage of matching in a fixed ratio:

Several advantages to matching with a variable ratio Paul R. Rosenbaum (1989):

Finding an optimal match with variable controls is equivalent to solving a particular minimum-cost flow problem in a network (Paul R. Rosenbaum 1989).

Optimal Full Matching

m<-fullmatch(dmat, data=gen.tox, min.controls = 1/7,max.controls = 7)
length(m)
[1] 47
sum(matched(m))
[1] 47
# Housekeeping
im<-as.integer(m)
gen.tox.fm<-cbind(gen.tox,im)
g.t.m<-gen.tox[matched(m),]

Welders <- gen.tox.fm %>%
  filter(Group=="Welders") %>%
  arrange(im) %>%
  dplyr::select(Age, Race, Smoker, ps, im)
#Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Age","Race","Smoker","ps","im")]

fmatched.controls <- gen.tox.fm %>%
  filter(Group=="Controls") %>%
  arrange(im) %>%
  dplyr::select(Age, Race, Smoker, ps, im)

kable(list(Welders,fmatched.controls), caption="Optimal full match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 5: Optimal full match using the Mahalanobis distance within PS calipers
Age Race Smoker ps im
38 0 0 0.46 1
53 0 0 0.19 2
38 0 1 0.60 3
44 0 0 0.34 4
38 0 1 0.60 5
39 0 1 0.57 6
39 0 1 0.57 6
39 0 1 0.57 6
41 0 1 0.53 6
39 0 1 0.57 6
33 1 1 0.51 7
35 0 1 0.65 8
27 0 0 0.68 9
43 1 0 0.20 10
43 0 1 0.49 11
36 0 0 0.50 12
35 0 0 0.52 12
37 0 0 0.48 13
37 0 0 0.48 13
34 0 0 0.54 14
35 0 1 0.65 15
Age Race Smoker ps im
40 0 0 0.42 1
63 0 0 0.09 2
50 0 0 0.23 2
52 0 0 0.20 2
56 0 0 0.15 2
36 0 1 0.64 3
47 0 0 0.28 4
42 0 0 0.38 4
41 0 0 0.40 4
41 1 1 0.35 4
44 0 0 0.34 4
45 0 0 0.32 4
42 0 0 0.38 4
36 0 1 0.64 5
39 0 1 0.57 6
31 1 1 0.55 7
35 0 1 0.65 8
30 0 0 0.63 9
48 1 0 0.14 10
56 1 1 0.13 10
51 1 0 0.12 10
44 0 1 0.47 11
35 0 0 0.52 12
38 0 0 0.46 13
34 0 0 0.54 14
34 0 1 0.67 15

Efficiency

(be continued)

Summary

Matching on one variable, the propensity score, tends to produce treated and control groups that, in aggregate, are balanced with respect to observed covariates;

\(\rightarrow\) Cons: individual pairs that are close on the propensity score may differ widely on specific covariates.

\(\rightarrow\) Solution: to form closer pairs, a distance is used that penalizes large differences on the propensity score, and then finds individual pairs that are as close as possible.

Pairs or matched sets are constructed using an optimization algorithm. Matching with variable controls and full matching combine elements of matching with elements of direct adjustment.

Full matching can often produce closer matches than pair matching.

Further Reading

Chapter 9, Basic Tools of Multivariate Matching from Rosenbaum, Paul R. Design of Observational Studies. 2nd ed. 2020., Springer International Publishing, 2020, https://doi.org/10.1007/978-3-030-46405-9.

Mahalanobis, P. C. 1936. “On the Generalized Distance in Statistics.” Journal Article. Proceedings of Indian National Science Academy 2 (1): 49–55.
Ming, K., and P. R. Rosenbaum. 2000. “Substantial Gains in Bias Reduction from Matching with a Variable Number of Controls.” Journal Article. Biometrics 56 (1): 118–24. https://doi.org/10.1111/j.0006-341X.2000.00118.x.
Rosenbaum, P. R. 1991. “A Characterization of Optimal Designs for Observational Studies.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 53 (3): 597–610. www.jstor.org/stable/2345589.
Rosenbaum, Paul R. 1989. “Optimal Matching for Observational Studies.” Journal Article. Journal of the American Statistical Association 84 (408): 1024–32. https://doi.org/10.2307/2290079.
Rosenbaum, Paul R., and Donald B. Rubin. 1985. “Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score.” Journal Article. The American Statistician 39 (1): 33–38. https://doi.org/10.2307/2683903.
Rubin, Donald B. 1980. “Bias Reduction Using Mahalanobis-Metric Matching.” Journal Article. Biometrics 36 (2): 293–98. https://doi.org/10.2307/2529981.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2022, April 19). HaiBiostat: Basic Tools of Multivariate Matching. Retrieved from https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/

BibTeX citation

@misc{nguyen2022basic,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Basic Tools of Multivariate Matching},
  url = {https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/},
  year = {2022}
}